| I confirm that I have fully read and understood the assignment brief for this module. | Y |
| Name: Chung Yan | Surname: Yu |
| Submission Date | 19-10-25 |
| Word Count: whole assignment including codes | |
| Word Count: main body excluding abstract, references and supplementary materials |
This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.
| I confirm the statement of originality as above | Y |
Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.
| Which permitted use of generative AI are you acknowledging? | Semantic search of literature and notes, outline creation, code debugging, output formatting and generation, informed feedback |
| Which generative AI tool did you use (name and version)? | Claude Code v2.0.3x (the version likely changed over usage), Claude Desktop v1.0.3x with PubMed Connector, M365 CoPilot |
| What did you use the tool for? | Searching for relevant journal publications, sanity check on coding strategies, collating notes in my Obsidian vault, debugging code by parsing error messages |
| How have you used or changed the generative AI’s output | E.g. my collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on. Code generated are checked, e.g. it tried to run Levene test on a model fitted using residuals as the response, this was rectified back to the relevant response variable. |
Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma, with prognosis influenced by demographic features (age, gender), integrated scoring systems such as the International Prognostic Index (IPI) and genetic alterations. Oncogenes MYC, BCL2, BCL6 emerged with strong DLBCL associations, and further studies have identified over 150 genetic drivers. Yet whether demographic factors modify these genetic markers’ prognostic effects remains understudied. This study analysed the gene expression data of 21 genes from 775 DLBCL patients to investigate whether age and gender modify gene expression-response associations. Ten genes demonstrated significant age-dependent expression levels (FDR adjusted p < 0.05), including BCL2, CCND2 and LMO2. However, hierarchical logistic regression models revealed only CCND2 showed significant interaction effects (Likelihood ratio test adjusted p = 0.006) with age when it comes to association with complete response. Older age groups showed significant increase (61-75 odds ratio p = 0.014; >75 odds ratio p = 0.001) in complete response probability with higher CCND2 expression, while younger patients (≤40, 41-60) showed the opposite insignificant trend. Gender revealed no significant effects (additive or modifying) across all genes. Only MYC gene expression level showed significant gene-only association (likelihood ratio test adjusted p = 0.010) with response completeness. While the modifying effects of CCND2 and age on prognosis could reconcile certain conflicting viewpoints in DLBCL, this dynamic requires further scrutiny and replication elsewhere. With only one out of 42 tested gene-demographic combination (21 genes × 2 demographic groups) showing interactive effects, demographic modifying effects remains a rarity amongst genetic markers. Thus a more conservative and universal approach with these genomic markers could still be the way forward.
Diffuse large B cell lymphoma (DLBCL) is the most common type of aggressive non-Hodgkin’s lymphoma (NHL)1,2, with a wide range of factors contributing to prognosis. For a comparable and more comprehensive pre-treatment prognosis model, the international prognostic index (IPI) emerged as a standardised model for DLBCL and other aggressive NHLs3, combining 5 factors: age, Ann Arbor stage classification, ECOG Performance Status, serum lactate dehydrogenase (LDH) level and the number of extranodal sites of disease. By pooling factors together, the IPI outperformed simple disease stage classification models such as Ann Arbor.4 went on to improve on the IPI, creating the National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI) as an update to the IPI. A key innovation of the NCCN-IPI was using cubic splines to model the continuous factors of age and LDH levels as higher resolution categorical factors, modifying them from binary factors to quaternary and ternary factors respectively. As prognosis models improved, so did treatment. The first generation chemotherapy of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of 30-35%5. With the addition of rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes curable for more than 60% of patients6.
While age is incorporated into the IPI (and subsequent NCCN-IPI), other demographic factors such as gender are not, despite established evidence for their prognostic involvement. Males demonstrate higher hazard ratios7 and gender-associated pharmacokinetic differences in rituximab lead to worse treatment responses in males8. These demographic effects raise questions about whether risk factors perform uniformly across patient subgroups. Beyond demographic and clinical factors, a series of genetic alterations have emerged as strong prognostic markers: BCL29, BCL610 and MYC11. These changes are observed at both the genotypic and phenotypic levels, as translocation mutations and protein overexpression respectively12. Furthermore, patients with varying combinations of these alterations have shown worse responses to R-CHOP treatments, highlighting the clinical relevance of genomic profiling. Some of these genetic markers, such as BCL613 and MYC14, have shown association with patient age, adding another layer of interaction that could influence prognosis.13 further demonstrated that IRF4 translocations are associated with age, and along with BCL6 and other genetic drivers, collectively lose prognostic significance when age is incorporated in multivariate models. This indicates that demographic factors may not merely add to genomic risk additively, but could modify how these genomic drivers influence clinical outcomes.
With the advent of machine learning methods came attempts to construct models based on clinical and genomic data.15 performed whole exome and transcriptome sequencing in 1001 DLBCL patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC alongside numerous newly characterised candidates. By combining genotypic alterations, gene expression data and cell-of-origin classification, they developed a genomic risk model that outperformed the IPI. However, the potential for demographic factors to modify these genomic associations remains understudied. Given that age and gender both demonstrate prognostic significance and influence treatment response, it remains unknown whether genomic risk markers perform uniformly across demographic subgroups. If these demographic factors modify how gene expression levels impact treatment response, such that the same gene expression levels in different gender or age groups lead to different prognostic outcomes, it would be crucial that these differences are identified and applied in future models to prevent inaccurate prognoses.
This study addresses this gap by investigating whether age and gender interact and modify the prognostic associations between gene expression markers and therapy response in DLBCL. The associations between the demographic factors (age & gender) with complete therapy response and the gene expression levels of 21 genes were first considered. Logistic regression models using gene expression levels of 21 genes were fitted with therapy response, and the interaction of demographic factors with these gene expression levels were examined. Exploring how demographic factors interact with known and potential genomic drivers of DLBCL could reveal age or gender-dependent tumour biology that inform therapeutic approaches.
Here the same datasets16 from the genomic risk model study by15 are utilised, specifically their clinical information dataset from Supplementary Table S1 and gene expression dataset from Supplementary Table S2. The clinical information dataset comprises 1001 DLBCL patients with 35 variables. Patients are recorded with anonymised IDs with the variables of interest to us being gender, response to initial therapy, age at diagnosis, and the expression level of MYC, BCL2 and BCL6 genes based on log2 transformation of RNA-sequencing Fragments per kilobase (FPKM) value. These columns are highlighted and used in further analysis not only due to their relevance, but also their relative completeness. The gene expression dataset comprises the gene expression level (log2 transformed again) of 19 genes (1 gene BCL6 is also recorded in the clinical information dataset) for 775 patients recorded using the same anonymised IDs.
The datasets are available via Elsevier’s open
access license, as shown on the ScienceDirect webpage
of15’s publication. For their
work, the authors obtained anonymised patient clinical information and
tumours, which were processed in line with a protocol approved by the
Institutional Review Board at Duke University. A total number of 2
datasets were downloaded as Excel files directly from their links on the
aforementioned ScienceDirect, Table
S1 and Table
S2 using the download.file function from R’s
utils package17 and read via the
readxl package18. The specific sheets were
selected then cleaned to be exported as .csv files (see 01-data-preprocessing.R20). The
processed .csv files were uploaded to this project’s GitHub
repository: mmc1-ClinicalInformation.csv
(raw), mmc2-GeneExpression.csv
(raw). This RMarkdown document is self-sufficient and contains code
for the relevant analysis and plots shown only, for a complete analysis
including assumption testing, post-hoc analysis, tests that showed
insignificant results, please refer to the supplementary-materials
folder of this repository. This document also supports sourcing the
datasets from 3 sources for redundancy: original ScienceDirect links,
processed .csv files on GitHub and local processed
.csv files when this repository is downloaded. For this
knitted21 HTML document, both datasets were
loaded directly from ScienceDirect’s supplementary materials: S1
Table (clinical data) and S2
Table (gene expression data).
The 2 datasets were joined using common patient IDs and 2 new columns
were added. age_group_nccn was obtained by transforming the
age at diagnosis variable into 4 categorical groups according to cutoffs
defined by4’s NCCN-IPI: \(≤40, 41 - 60, 61 - 75, >75\).
complete_response was added by combining “Partial response”
and “No response” as “Incomplete response” from
response_therapy while keeping “Complete response”
unchanged. This prepared the dataset for hierarchical binomial logistic
regression. As for duplicate gene expression levels for BCL6, the two
columns were compared to verify identicality and 1 was discarded. The
resulting table was one with 775 patients and 27 variables, of which
include the anonymised patient ID; demographic variables such as gender
(binary: F/M), age (continuous, at diagnosis) and age group (quaternary,
as defined by NCCN-IPI); clinical information of therapy response
(ternary: Complete response, Partial response, No response); and the
log2 transformed gene expression levels of 21 genes: MYC, BCL2, BCL6,
ITPKB, MME, MYBL1, DENND3, NEK6, LMO2, LRMP, SH3BP5, IRF4, PIM1, ENTPD1,
BLNK, CCND2, ETV6, FUT8, BMF, IL16 and PTPN1.
Gene expression data were limited to these 21 genes available from16’s datasets (Supplementary Tables S1 & S2), representing 2 categories. The first category is the “big 3” oncogenic drivers of MYC, BCL2 and BCL6, serving as strong prognostic markers. Moreover, MYC14 and BCL613 have both shown age-dependent prognostic effects, suggesting age may modify how these markers influence clinical outcomes. The second category comprises 19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB)22, with BCL6 appearing in both categories. Despite their initial utility for classification, gene members such as CCND2 and LMO223 have shown to be strong survival predictors alongside BCL2 and BCL6. More importantly, IRF4 is among the genetic features showing age associations and contributes to genetic complexity that loses prognostic significance when age is incorporated13, alongside BCL6. Both categories therefore contain genes with potential to interact with demographic factors, making them suitable for testing whether demographic factors modify their prognostic associations.
Overall, data completeness for these 27 columns was exceptional, with
23 columns being 100%, and the remaining 4 columns all with more than
93% completeness. Processing is documented in 02-data-processing.R.
Characteristics of the cohort are summarised24 in
Table 1.
Table 1. Baseline Cohort Characteristics
| Characteristic | Total (N=775) | Complete Response | Incomplete Response |
|---|---|---|---|
| Sample size | N = 775 | n = 598 | n = 126 |
| Gender | |||
| Female | 338 (43.6%) | 258 (43.1%) | 54 (42.9%) |
| Male | 437 (56.4%) | 340 (56.9%) | 72 (57.1%) |
| Age at diagnosis, years | 61 ± 15.4 | 59.8 ± 15.7 | 63.1 ± 14.3 |
| Age groups (NCCN-IPI) | |||
| ≤40 years | 75 (10.1%) | 67 (11.5%) | 8 (6.5%) |
| 41-60 years | 249 (33.5%) | 202 (34.8%) | 41 (33.1%) |
| 61-75 years | 285 (38.3%) | 218 (37.5%) | 47 (37.9%) |
| >75 years | 135 (18.1%) | 94 (16.2%) | 28 (22.6%) |
Demographic and clinical characteristics stratified by treatment
response.
Values shown as n (%) for categorical variables and mean
± SD for continuous variables.
Treatment response categories
combine partial and no response as “Incomplete response” versus complete
response.
NCCN-IPI age groups represent National Comprehensive
Cancer Network International Prognostic Index classifications.
Percentages are calculated within each response group for gender and age
group distributions.
First, the variation of gene expression levels of the 21 genes in
different gender and age groups were scrutinised using the Student’s
t-test and one-way Welch’s ANOVA25 respectively, and with false
discovery rate (FDR) adjustment for the p-values26 due to the
number of genes analysed (see 04-data-analysis.R).
Post-hoc Games-Howell tests were then conducted for significant ANOVA
results27. Assumptions for these tests
were checked with diagnostic tests and plots, and the overall
distribution of gene expression against age and gender visually examined
(see 03-data-exploration.R).
Second, the association of therapy response completeness with gender and
age groups were studied with the Chi-squared tests with expected
frequencies checked to be \(> 5\) in
05-response-analysis.R.
Finally, to study potential interactions between demographic factors and gene expression levels on predicting therapy response, hierarchical logistic regression models28 were fitted to systematically examine the effects of these factors and their interactions. The models were nested as such:
The fitted models were then analysed with the likelihood-ratio test
(LRT) to obtain FDR-corrected p-values. Significant models were then
checked for model fit with goodness-of-fit with Chi-square tests and
Akaike Information Criterion (AIC) scores, along with influential
observations and collinearity29 (see 06-effect-analysis.R.)
Gene expression levels showed no significant difference across gender groups after t-test with FDR correction, indicating gender does not significantly affect baseline gene expression of these genetic markers. Conversely, 10 genes showed significantly different gene expression activity after Welch’s ANOVA with FDR correction across NCCN-IPI age groups: BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.010), SH3BP5 (p = 0.016), ITPKB (p = 0.020), CCND2 (p = 0.020), PTPN1 (p = 0.031), BMF (p = 0.034), FUT8 (p = 0.037), LMO2 (p = 0.038). Post-hoc analysis with Games-Howell revealed that age groups mostly clustered into two statistially distinct groups for each gene, save for BLNK with three groups Figure 132. This points to age-dependent tumour biology at the transcription level.
Figure 1. Gene Expression Patterns Across Age Groups.
Sample sizes (n) are shown below each age group.
Outliers are
highlighted in darker borders.
Neither gender (p = 1.000) nor age group (p = 0.173) showed significant associations with complete response, although incomplete response proportions did trend upwards as age increased. While neither demographic factors showed prognostic potential on this clinical outcome, whether they modify the gene expression-therapy response relationship is to be explored in the next section.
As a part of fitting hierarchical models for each gene with either age or gender groups, gene-only models on predicting therapy response were also fitted (Figure 2). This revealed that one gene out of 21, MYC, displayed significant gene-only effects (LRT FDR-adjusted p-value = 0.010, AIC = 647.57, Chi-square goodness-of-fit p-value = 0.947), with higher expression predicting lower complete response probability. With the lowest AIC score among the 8 MYC models, adding demographic or demographic-MYC intereaction did not improve model fit, indicating MYC’s prognostic effect is consistent across age and gender groups (Figure 3a).
Figure 2. Gene-only Effects on Complete Response
Fig. 2
Orange: Adjusted significant
Blue:
Adjusted insignificant
Lilac: Insignificant
Confidence interval:
95%
MYC is the only gene showing significant gene effect after FDR
correction, being well clear of the 1.0 reference (no effect) reference
line even when considering its CI error bars.
Systematic testing of 21 genes across both age and gender yielded 42 gene-demographic models, of which no gender model showed significant effect. Neither gene-gender interaction nor gender main effect models attained significant results for any gene after FDR adjustment. For age, only the CCND2 gene showed significant (LRT FDR-adjusted p-value = 0.006, AIC = 643.12, Chi-square goodness-of-fit p-value = 0.973) effects amongst the interaction models of all 21 genes, and the rest 20 genes showed neither main (additive) nor interactive effects with age. Having the lowest AIC score (min delta = 12.7) amongst the 4 CCND2 models indicate that response completeness is best predicted with gene expression when its interaction with age is factored in.
Clear age-dependent directional effects emerge for CCND2 (Figure 3b), with higher CCND2 expression levels affecting the response completeness of the age groups differently. The curves for age groups ≤40 and 41-60 and the curves for age groups 61-75 and >75 are going in opposite directions. Complete response probability decreases with CCND2 expression level for the 2 younger groups, while complete response probability increases with CCND2 expression level for the 2 older groups. Further scrutinising the odds ratio33 of the 4 age groups (Figure 4) confirms that both older age groups showed significantly higher complete response odds (61-75 p = 0.014; >75 p = 0.001), and younger age groups (≤40, 41-60) have trends for worse but statistically insignificant odds.
Figure 3. Logistic Regression Models - Gene Expression Effects on Complete Response
Fig. 3
A. Logistic regression
plot of MYC gene expression against complete response probability,
age-interaction model
B. Logistic regression plot
of CCND2 gene expression against complete response probability,
age-interaction model
The MYC gene showed no significant
age-interaction effects, contrasting the curves of the different age
groups for CCND2.
Figure 4. CCND2 Interaction Effects - Age Group Modifying Gene Expression Effects on Complete Response
Fig. 4
Odds ratio of age groups
Orange:
Significant
Lilac: Insignificant
Confidence interval: 95%
Both
older age groups showed significant modifying effect.
While gender had displayed potential to influence clinical outcomes34, the NCCN-IPI did not include gender as the authors found gender did not substantially enhance their model’s predictive capability4. The lack of gender-dependent gene expression patterns and gene-modifying effects on prognosis in this analysis further confirms observation.
Of the big three oncogene drivers of DLBCL, only MYC showed significant (gene-only) effects. MYC’s inclusion was in a sense expected as its role as an oncogene is well established35. Whereas increase in BCL2 expression has been associated with inferior outcomes36 and BCL6 expression bodes well for prognosis38. Both genes however have significant LRT p-values when not corrected by FDR, BCL2’s gene-only model (unadjusted p = 0.039) and BCL6’s interaction model (unadjusted p = 0.037), so perhaps their effects were merely not significant enough for this dataset.
Connecting the two genes of significance of this study, MYC and CCND2, is that the oncoprotein product of MYC actually binds to the promoter of CCND2 in vivo39, yet within this dataset, the expression of these genes affect different age groups differently. The revelation of CCND2’s age modifying effects might actually provide insights to one of the more contested topics in the field, as there are conflicting opinions and associations when it comes to CCND2, where the activity of the gene could indicate better40 and worse41 prognoses. Obviously, more work needs to be conducted to establish this modifying effect of age-CCND2 on prognosis is robust and not spurious.
In the end, it has to be acknowledged that this study is not without flaws. The available dataset, especially gene expression data, was pre-selected for publication, not for the exploratory analysis of demographic effects on the prognostic powers of genomic drivers. The dataset also lacked ethnicity information, when it was likely patients from different continents were included. It would also be of great boon to the study to be able to benchmark the analysis here with Reddy’s genomic risk model. Unfortunately, the (hyper-)parameters required to rebuild such model along with the interactive webtool for this model are no longer accessible. This further highlights reproducibility in cancer research42 and bioinformatics43 in general, and make me appreciate the emphasis of robust replicability our lecturers had imparted on us during the first module.
This analysis of 775 DLBCL patients discovered 10 genes BCL2, BLNK, MYBL1, SH3BP5, ITPKB, CCND2, PTPN1, BMF, FUT8 and LMO2, showed age-dependent gene expression levels. One gene CCND2 showed modifying effects with age on predicting therapy outcomes, and the MYC gene showed gene only effects with age. The discovery of CCND2’s modifying effect could partially resolve the conflicting accounts of how CCND2 activity associate with prognosis. This is an exciting find that still needs work to replicate and validate.